The purpose of the project is to provide an easy, consumable form of skill rating for professional Call of Duty players. This could be used as a descriptive statistic, but it can also be used to guess the winner of the largest tournament of the year, Call of Duty Champs.
Call of Duty is a first-person shooter that first began in 2003. Since then, it has become one of the largest multiplayer video game franchises to exist. During this time, a competitive scene for the game has gained traction. In 2016, the Call of Duty World League was born – a sponsored league that hosts major tournaments throughout the year for the best players in the world to play in. In these events, these pros play three different game modes to decide the winner of a series. These game modes are Hardpoint, Search and Destroy, and then a third game mode that often changes yearly. For the data that we are covering, the third game mode is Control. All of the teams in the league consist of 5 players, and the series are Best of 5’s.
embed_url("https://www.youtube.com/watch?v=VQC0aZuGBFs&t=2740s")
In Hardpoint, the two teams must fight over a point on the map where every second they spend in this point, they gain one point. This point is called the “hardpoint.” If two teams are in the hardpoint at the same time, then neither teams collects points. Every sixty seconds, the hardpoint changes locations on the map, so teams must make tactical decisions to be able to rotate across the map. The first team to 250 points wins the map.
embed_url("https://www.youtube.com/watch?v=VQC0aZuGBFs&t=2740s") %>%
use_start_time(6*60 + 35)
In Search and Destroy, the two teams play rounds where each player only has one life; if you die, you are dead until the next round. The objective is to either kill the entire other team before the time limit, or if you are on offense, then you can plant the bomb. If the bomb detonates after 45 seconds, then you also win the round. The first team to win 6 rounds wins the map.
embed_url("https://www.youtube.com/watch?v=VQC0aZuGBFs&t=2740s") %>%
use_start_time(18*60 + 39)
In Control, there is an offense team and a defense team. There are multiple rounds where each team switches off between offense and defense. Each team has 30 lives per round. The first time to win three rounds wins the map. The offensive team is trying to either capture two points on the map, or eliminate all 30 lives of the other team. The defensive team is trying to either defend the two points before the time rounds out, or eliminate all 30 lives of the other team.
embed_url("https://www.youtube.com/watch?v=VQC0aZuGBFs&t=2740s") %>%
use_start_time(45*60 + 40)
This project makes use of official CWL data that is uploaded on Github. All data is organized relatively cleanly and all missing data is reported.
proleague2019 <- read_csv(url("https://raw.githubusercontent.com/Activision/cwl-data/master/data/data-2019-07-05-proleague.csv"))
fortworth2019 <- read_csv(url("https://raw.githubusercontent.com/Activision/cwl-data/master/data/data-2019-03-17-fortworth.csv"))
london2019 <- read_csv(url("https://raw.githubusercontent.com/Activision/cwl-data/master/data/data-2019-05-05-london.csv"))
anaheim2019 <- read_csv(url("https://raw.githubusercontent.com/Activision/cwl-data/master/data/data-2019-06-16-anaheim.csv"))
proleagueFinals2019 <- read_csv(url("https://raw.githubusercontent.com/Activision/cwl-data/master/data/data-2019-07-21-proleague-finals.csv"))
# all stats for all major tournaments (EXCEPT CHAMPS) in BO4 (2019)
majors2019 <- rbind(proleague2019, fortworth2019, london2019, anaheim2019, proleagueFinals2019)
# champs will act as our test data; we will try and predict the winner
champs2019 <- read_csv(url("https://raw.githubusercontent.com/Activision/cwl-data/master/data/data-2019-08-18-champs.csv"))
In order to assign an overall score to each individual player, we will need to address Hardpoint, Search and Destroy, and the Control separately. Once we have an individual score for each of the three game modes, we can use these to determine a final score.
Hardpoint: 1. player – what player does the data correspond to
2. mode – game mode
3. win – ‘W’ or ‘L’; use to find overall player win/loss ratio
4. k_d – kill/death ratio; used to show overall impact on the map
5. assists – in addition to k/d, assists show overall support on the map; higher assists can indicate better team work
6. accuracy_percent – player accuracy for each match
7. damage_dealt – total damage done in the map
8. player_spm – score per minute
9. hill_time_s – hill time measured in seconds
10. hill_captures – shows activity on the map (MIGHT INCLUDE)
11. hill_defends – shows activity on the map (MIGHT INCLUDE)
12. match_id – helpful for getting rid of missing data
Search and Destroy: 1. player – what player does the data correspond to
2. mode – game mode
3. win – ‘W’ or ‘L’; use to find overall player win/loss ratio
4. k_d – kill/death ratio; used to show overall impact on the map
5. assists – in addition to k/d, assists show overall support on the map; higher assists can indicate better team work
6. accuracy_percent – player accuracy for each match
7. damage_dealt – total damage done in the map
8. player_spm – score per minute
9. fb_round_ratio – ‘snd_firstbloods’/‘snd_rounds’ (NOT INCLUDED IN BASE DATA SET)
10. bomb_sneak_defuses – sneak defuses are often in pivotal rounds
11. bomb_plants – good indicator of role (MIGHT INCLUDE)
12. bomb_defuses – good indicator of role (MIGHT INCLUDE)
13. match_id – helpful for getting rid of missing data
Control: 1. player – what player does the data correspond to
2. mode – game mode
3. win – ‘W’ or ‘L’; use to find overall player win/loss ratio
4. k_d – kill/death ratio; used to show overall impact on the map
5. assists – in addition to k/d, assists show overall support on the map; higher assists can indicate better team work
6. accuracy_percent – player accuracy for each match
7. damage_dealt – total damage done in the map
8. player_spm – score per minute
9. match_id – helpful for getting rid of missing data
The data below is for all of the majors throughout the season, except for COD Champs. We will reserve COD Champs to act as a test set. The raw data from each major is merged into one major dataset, further broken up into Hardpoint, SND, and Control datasets.
# CLEANING
majors2019 <- majors2019 %>% clean_names(.)
# new dataset that contains all of the missing data, just in case
majors2019_missing <- sqldf('SELECT * FROM majors2019 WHERE match_id LIKE "missing%"')
# whole event data, all players and all maps, where player names are organized alphabetically
majors2019 <- majors2019[order(majors2019$player),]
# removes missing values
majors2019 <- sqldf('SELECT * FROM majors2019 WHERE match_id NOT LIKE "missing%"')
# calculates all the players that have played more than 50 games
player_numgames <- count(majors2019, player) %>% subset(., n > 50) %>% remove_cols(n)
# includes all existing data for all players that have played more than 50 games (arbitrary number)
majors2019 <- sqldf('SELECT * FROM majors2019 WHERE player IN player_numgames')
# removes all matches where damage = 0; almost always occurs as a result of data loss
majors2019 <- sqldf('SELECT * FROM majors2019 WHERE damage_dealt != "0"')
# changes W to 1, L to 0
majors2019$win <- ifelse(majors2019$win == "W", 1, 0) %>%
as.factor()
# assigning a role to each player to allow for more precise comparisons
playerRoles <- majors2019 %>%
group_by(player) %>%
count(player, fave_weapon) %>%
top_n(1, n) %>%
mutate(role = fave_weapon) %>%
subset(select = -c(fave_weapon, n))
# replace fav gun with corresponding role
playerRoles$role <- str_replace(playerRoles$role, "Saug 9mm", "1")
playerRoles$role <- str_replace(playerRoles$role, "Maddox RFB", "2")
playerRoles$role <- str_replace(playerRoles$role, "ICR-7", "3")
# making factors
playerRoles$role <- factor(playerRoles$role)
# manually adjustment for player TJHaly
playerRoles <- playerRoles[-c(83), ]
majors2019 <- dplyr::inner_join(playerRoles, majors2019, by = "player")
A player’s role is defined as a sub (1), flex (2), or an ar (3).
# all 2019 hardpoint data
hp2019 <- sqldf('SELECT player, k_d, role, win, kills, deaths, x, assists, damage_dealt, player_spm, hill_time_s, hill_captures, hill_defends, x2_piece, x3_piece, x4_piece FROM majors2019 WHERE mode == "Hardpoint"')
hp2019 <- hp2019[order(hp2019$player),]
# all 2019 SND data
snd2019 <- sqldf('SELECT match_id, team, player, role, win, kills, deaths, k_d, assists, damage_dealt, player_spm, bomb_sneak_defuses, bomb_plants, bomb_defuses, snd_rounds, snd_firstbloods, snd_1_kill_round, snd_2_kill_round, snd_3_kill_round, snd_4_kill_round, x2_piece, x3_piece, x4_piece FROM majors2019 WHERE mode == "Search & Destroy"')
# adds new column with fb/round ratio
snd2019 <- add_column(snd2019, fb_round_ratio = snd2019$snd_firstbloods/snd2019$snd_rounds)
# adding a new column with average first bloods for the season
snd2019 <- snd2019 %>%
group_by(player) %>%
mutate(fb_avg = mean(snd_firstbloods))
# puts data in alphabetical order
snd2019 <- snd2019[order(snd2019$player),]
# all 2019 CONTROL data
control2019 <- sqldf('SELECT player, role, win, k_d, assists, damage_dealt, player_spm, x2_piece, x3_piece, x4_piece, ctrl_firstbloods, ctrl_firstdeaths, ctrl_captures FROM majors2019 WHERE mode == "Control"')
control2019 <- control2019[order(control2019$player),]
champs2019 <- champs2019 %>% clean_names(.)
champs2019 <- champs2019[order(champs2019$player),]
champs2019 <- sqldf('SELECT * FROM champs2019 WHERE match_id NOT LIKE "missing%"')
champs2019 <- sqldf('SELECT * FROM champs2019 WHERE damage_dealt != "0"')
# changes W to 1, L to 0
champs2019$win <- ifelse(champs2019$win == "W", 1, 0) %>%
as.factor()
champs2019 <- dplyr::inner_join(playerRoles, champs2019, by = "player")
# CHAMPS 2019 hardpoint data
hpChamps <- sqldf('SELECT player, k_d, role, win, kills, deaths, x, assists, damage_dealt, player_spm, hill_time_s, hill_captures, hill_defends FROM champs2019 WHERE mode == "Hardpoint"')
hpChamps <- hpChamps[order(hpChamps$player),]
# CHAMPS 2019 SND data
sndChamps <- sqldf('SELECT player, win, role, k_d, assists, damage_dealt, player_spm, bomb_sneak_defuses, bomb_plants, bomb_defuses, snd_rounds, snd_firstbloods FROM champs2019 WHERE mode == "Search & Destroy"')
# adds new column with fb/round ratio
sndChamps <- add_column(sndChamps, fb_round_ratio = sndChamps$snd_firstbloods/sndChamps$snd_rounds)
# adding a new column with average first bloods for the season
sndChamps <- sndChamps %>%
group_by(player) %>%
mutate(fb_avg = mean(snd_firstbloods))
# puts data in alphabetical order
sndChamps <- sndChamps[order(sndChamps$player),]
# CHAMPS 2019 CONTROL data
controlChamps <- sqldf('SELECT player, role, win, k_d, assists, damage_dealt, player_spm FROM champs2019 WHERE mode == "Control"')
controlChamps <- controlChamps[order(controlChamps$player),]
# getting all necessary data for hardpoint
mergedhp2019 <- sqldf('SELECT match_id, team, player, role, kills, deaths, win, assists, damage_dealt, player_spm, hill_captures, hill_defends FROM majors2019 WHERE mode == "Hardpoint"')
# organizing by each match
mergedhp2019 <- mergedhp2019[order(mergedhp2019$match_id),]
# removing all matches that DON'T include all 10 players
# calculates all the matches that have all 10 players
match_numplayers <- count(mergedhp2019, match_id) %>% subset(., n == 10) %>% remove_cols(n)
# includes matches where all 10 players have existing data
mergedhp2019 <- sqldf('SELECT * FROM mergedhp2019 WHERE match_id IN match_numplayers')
# merge rows so that all the players from each team are one row; expect 800 observations with about 50 variables
test_mergedhp2019 <- mergedhp2019 %>%
rename(damage = damage_dealt,
spm = player_spm,
hillcaptures = hill_captures,
hilldefends = hill_defends) %>%
mutate(rn = rowid(match_id, team)) %>%
pivot_wider(names_from = rn, values_from = c(win,
player,
kills,
deaths,
assists,
damage,
spm,
hillcaptures,
hilldefends)) %>%
subset(select = -c(win_2, win_3, win_4, win_5,
player_1, player_2, player_3, player_4, player_5)) %>%
rename(win = win_1)
# team_mergedhp2019 <- test_mergedhp2019 %>%
# group_by(match_id, team) %>%
# mutate(kills = sum(kills_1, kills_2, kills_3, kills_4, kills_5),
# deaths = sum(deaths_1, deaths_2, deaths_3, deaths_4, deaths_5),
# kd = kills/deaths,
# assists = sum(assists_1, assists_2, assists_3, assists_4, assists_5),
# spm = mean(spm_1, spm_2, spm_3, spm_4, spm_5),
# hillcaptures = sum(hillcaptures_1, hillcaptures_2, hillcaptures_3, hillcaptures_4, hillcaptures_5),
# hilldefends = sum(hilldefends_1, hilldefends_2, hilldefends_3, hilldefends_4, hilldefends_5),
# damage = sum(damage_1, damage_2, damage_3, damage_4, damage_5)) %>%
# subset(select = c(win, kd, assists, spm, hillcaptures, hilldefends, damage))
# getting all necessary data for hardpoint
team_snd2019 <- sqldf('SELECT match_id, k_d, role, team, player, win, assists, damage_dealt, player_spm, bomb_sneak_defuses, bomb_plants, bomb_defuses, snd_firstbloods, snd_rounds FROM snd2019')
# organizing by each match
team_snd2019 <- team_snd2019[order(team_snd2019$match_id),]
# removing all matches that DON'T include all 10 players
# calculates all the matches that have all 10 players
match_numplayers <- count(team_snd2019, match_id) %>% subset(., n == 10) %>% remove_cols(n)
# includes matches where all 10 players have existing data
team_snd2019 <- sqldf('SELECT * FROM team_snd2019 WHERE match_id IN match_numplayers')
# merge rows so that all the players from each team are one row; expect 800 observations with about 50 variables
team_snd2019 <- team_snd2019 %>%
rename(kd = k_d,
damage = damage_dealt,
spm = player_spm,
fb = snd_firstbloods,
rounds = snd_rounds,
defuses = bomb_defuses,
plants = bomb_plants,
nd = bomb_sneak_defuses) %>%
mutate(rn = rowid(match_id, team)) %>%
pivot_wider(names_from = rn, values_from = c(win,
player,
kd,
role,
assists,
damage,
spm,
fb,
rounds,
defuses,
plants,
nd)) %>%
subset(select = -c(win_2, win_3, win_4, win_5,
player_1, player_2, player_3, player_4, player_5,
rounds_2, rounds_3, rounds_4, rounds_5,
match_id, team)) %>%
rename(win = win_1) %>%
rename(rounds = rounds_1)
# team_snd2019 <- team_snd2019 %>%
# group_by(match_id, team) %>%
# mutate(kills = sum(kills_1, kills_2, kills_3, kills_4, kills_5),
# deaths = sum(deaths_1, deaths_2, deaths_3, deaths_4, deaths_5),
# kd = kills/deaths,
# assists = sum(assists_1, assists_2, assists_3, assists_4, assists_5),
# spm = mean(spm_1, spm_2, spm_3, spm_4, spm_5),
# damage = sum(damage_1, damage_2, damage_3, damage_4, damage_5),
# fb = sum(fb_1, fb_2, fb_3, fb_4, fb_5),
# fbratio = fb/rounds_1,
# plants = sum(plants_1, plants_2, plants_3, plants_4, plants_5),
# defuses = sum(defuses_1, defuses_2, defuses_3, defuses_4, defuses_5),
# nd = sum(nd_1, nd_2, nd_3, nd_4, nd_5)) %>%
# subset(select = c(win, role, kd, kills, deaths, assists, rounds_1, damage, fbratio, plants, defuses, nd))
For my exploratory data analysis, I will be using just the season data. It will not include the Champs data.
ggplot(majors2019, aes(x = reorder(player, k_d), y = k_d)) + geom_boxplot() + coord_flip(ylim = c(0, 3.5)) + labs(y = "Kill/death ratio", x = "Player", subtitle = "OVERALL Player K/D's, 2019 Season (BO4), Descending")
ggplot(hp2019, aes(x = reorder(player, k_d), y = k_d)) + geom_boxplot() + coord_flip(ylim = c(0, 3.5)) + labs(y = "Kill/death ratio", x = "Player", subtitle = "Player K/D's for HARDPOINT, 2019 Season (BO4), Descending")
ggplot(snd2019, aes(x = reorder(player, k_d), y = k_d)) + geom_boxplot() + coord_flip(ylim = c(0, 5)) + labs(y = "Kill/death ratio", x = "Player", subtitle = "Player K/D's for SEARCH AND DESTROY, 2019 Season (BO4), Descending")
ggplot(control2019, aes(x = reorder(player, k_d), y = k_d)) + geom_boxplot() + coord_flip(ylim = c(0, 3.5)) + labs(y = "Kill/death ratio", x = "Player", subtitle = "Player K/D's for CONTROL, 2019 Season (BO4), Descending")
Search and Destroy is a gamemode that has multiple rounds, where in each round, every player only has one life. A “first blood” is the first kill of the round and is usually highly influential. This a common stat that commentators and the community look at.
# player firstblood average for SND 2019
ggplot(snd2019, aes(x = reorder(player, fb_avg), y = fb_avg)) + geom_point() + coord_flip(ylim = c(0, 3)) + labs(y = "Firstblood Average", x = "Player", subtitle = "Player Firstblood Average for SEARCH AND DESTROY, 2019 Season (BO4), Descending")
# player firstbloods for SND 2019
ggplot(snd2019, aes(x = reorder(player, snd_firstbloods), y = snd_firstbloods)) + geom_boxplot() + coord_flip(ylim = c(0, 6)) + labs(y = "Firstbloods", x = "Player", subtitle = "Player Firstbloods for SEARCH AND DESTROY, 2019 Season (BO4), Descending")
# player firstblood/round for SND 2019
ggplot(snd2019, aes(x = reorder(player, fb_round_ratio), y = fb_round_ratio)) + geom_boxplot() + coord_flip(ylim = c(0, 0.6)) + labs(y = "Firstblood/round ratio", x = "Player", subtitle = "Player Firstblood/Round for SEARCH AND DESTROY, 2019 Season (BO4), Descending")
# player damage dealt OVERALL 2019
# removes all entries where damage is 0; this is almost always a result of data loss
majors2019 <- sqldf('SELECT * FROM majors2019 WHERE damage_dealt != "0"')
playerDamage <- sqldf('SELECT player, damage_dealt FROM majors2019 WHERE damage_dealt != "0"')
ggplot(playerDamage, aes(x = reorder(player, damage_dealt), y = damage_dealt)) + geom_boxplot() + coord_flip(ylim = c(0, 10000)) + labs(y = "Damage Dealt", x = "Player", subtitle = "OVERALL Player Damage Dealt, 2019 Season (BO4), Descending")
# Overall score per minute for 2019 season
ggplot(majors2019, aes(x = reorder(player, player_spm), y = player_spm)) + geom_boxplot() + coord_flip(ylim = c(0, 675)) + labs(y = "Score per minute", x = "Player", subtitle = "OVERALL Player Score per minute, 2019 Season (BO4), Descending")
# Overall number of wins for 2019 season
playerwins <- sqldf('SELECT player, win FROM majors2019 WHERE win == "1"') # selects all the wins for each player
playerwins <- playerwins %>% count(player) # counts the number of wins per player
ggplot(playerwins, aes(x = reorder(player, n), y = n)) + geom_bar(stat = 'identity') + coord_flip() + labs(y = "Number of Wins", x = "Player", subtitle = "OVERALL Number of Wins per Player, 2019 Season (BO4), Descending")
The top 4 players with the most amount of wins in the season are Slasher, Octane, Kenny, and Enable. The interesting part about this is that all of these players were on the same team, 100 Thieves. They all tied with 116 wins during the season.
playerwins %>%
ggplot(aes(x = n)) + geom_histogram(binwidth = 15, color = "black", fill = "white")
The number of wins appears to follow a normal distribution. The left side of the histogram appears to be slightly more populated, but I hypothesize that this is due to players that didn’t play for the whole season.
I will be trying to predict whether an individual player will win or lose a game based on his statistics in the given game.
Splitting Data:
hp2019_wl <- hp2019
set.seed(3068)
hp2019_wlsplit <- hp2019_wl %>%
initial_split(prop = 0.8, strata = "win")
hp2019_train <- training(hp2019_wlsplit)
hp2019_test <- testing(hp2019_wlsplit)
head(hp2019_train)
## player k_d role win kills deaths x assists damage_dealt player_spm
## 4 Abezy 0.66 1 0 19 29 -10 6 3891 290.9
## 5 Abezy 1.18 1 0 26 22 4 8 4480 393.3
## 12 Abezy 0.88 1 0 22 25 -3 14 4515 322.3
## 15 Abezy 0.76 1 0 19 25 -6 8 4868 295.7
## 16 Abezy 0.83 1 0 20 24 -4 4 3954 269.5
## 20 Abezy 1.33 1 0 28 21 7 7 4733 400.3
## hill_time_s hill_captures hill_defends x2_piece x3_piece x4_piece
## 4 48 4 5 2 0 0
## 5 55 5 14 2 1 0
## 12 80 9 6 4 0 0
## 15 77 7 11 5 0 0
## 16 35 3 6 2 2 0
## 20 71 6 10 3 0 0
dim(hp2019_train)
## [1] 3551 16
dim(hp2019_test)
## [1] 889 16
prop.table(table(hp2019_train$win))
##
## 0 1
## 0.4987328 0.5012672
Creating a general decision tree specification using rpart:
hp_tree_spec <- decision_tree() %>%
set_engine("rpart")
hp_class_tree_spec <- hp_tree_spec %>%
set_mode("classification")
hp_class_tree_fit <- hp_class_tree_spec %>%
fit(win ~ k_d + assists + damage_dealt + player_spm + hill_time_s + hill_captures + hill_defends + x2_piece + x3_piece + x4_piece, data = hp2019_train)
hp_class_tree_fit %>%
extract_fit_engine() %>%
rpart.plot()
## Warning: Cannot retrieve the data used to build the model (so cannot determine roundint and is.binary for the variables).
## To silence this warning:
## Call rpart.plot with roundint=FALSE,
## or rebuild the rpart model with model=TRUE.
Checking confusion matrix and accuracy of the train data:
augment(hp_class_tree_fit, new_data = hp2019_train) %>%
conf_mat(truth = win, estimate = .pred_class)
## Truth
## Prediction 0 1
## 0 1379 601
## 1 392 1179
hp_dt_accuracy <- augment(hp_class_tree_fit, new_data = hp2019_train) %>%
accuracy(truth = win, estimate = .pred_class)
hp_dt_accuracy
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.720
# library(randomForest)
# library(datasets)
# library(caret)
set.seed(306)
hp_rf <- randomForest(win ~ k_d + assists + damage_dealt + player_spm + hill_time_s + hill_captures + hill_defends + x2_piece + x3_piece + x4_piece, data = hp2019_train, ntree = 500)
print(hp_rf)
##
## Call:
## randomForest(formula = win ~ k_d + assists + damage_dealt + player_spm + hill_time_s + hill_captures + hill_defends + x2_piece + x3_piece + x4_piece, data = hp2019_train, ntree = 500)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 26.13%
## Confusion matrix:
## 0 1 class.error
## 0 1300 471 0.2659514
## 1 457 1323 0.2567416
hp_rfAccuracy <- (1300 + 1323) / (1300 + 1323 + 457 + 471)
hp_rfAccuracy
## [1] 0.7386652
importance(hp_rf)
## MeanDecreaseGini
## k_d 398.48898
## assists 143.83018
## damage_dealt 229.80142
## player_spm 289.30019
## hill_time_s 301.43364
## hill_captures 105.26674
## hill_defends 165.37544
## x2_piece 87.93843
## x3_piece 38.17524
## x4_piece 13.21070
varImpPlot(hp_rf)
hp_pred1 = predict(hp_rf, type = "prob")
library(ROCR)
hp_perf = prediction(hp_pred1[,2], hp2019_train$win)
# 1. Area under curve
hp_auc = performance(hp_perf, "auc")
hp_auc
## A performance instance
## 'Area under the ROC curve'
# 2. True Positive and Negative Rate
hp_pred3 = performance(hp_perf, "tpr","fpr")
# 3. Plot the ROC curve
plot(hp_pred3, main="ROC Curve for Random Forest", col=2, lwd=2)
abline(a=0, b=1, lwd=2, lty=2, col="gray")
plot(hp_rf)
Recipe, Engine, and Workflow:
hp_recipe <- recipe(win ~ k_d + assists + damage_dealt + player_spm + hill_time_s + hill_captures + hill_defends + x2_piece + x3_piece + x4_piece, data = hp2019_train)
hp_log_reg <- logistic_reg() %>%
set_engine("glm") %>%
set_mode("classification")
hp_log_wkflow <- workflow() %>%
add_model(hp_log_reg) %>%
add_recipe(hp_recipe)
hp_log_fit <- fit(hp_log_wkflow, hp2019_train)
hp_log_fit %>%
tidy()
## # A tibble: 11 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -5.00 0.264 -19.0 3.09e-80
## 2 k_d 4.06 0.261 15.5 1.81e-54
## 3 assists 0.183 0.0162 11.3 2.09e-29
## 4 damage_dealt -0.000254 0.0000594 -4.29 1.82e- 5
## 5 player_spm 0.00175 0.00114 1.53 1.26e- 1
## 6 hill_time_s 0.0189 0.00179 10.5 6.67e-26
## 7 hill_captures 0.0547 0.0222 2.46 1.40e- 2
## 8 hill_defends -0.206 0.0171 -12.0 2.32e-33
## 9 x2_piece -0.0422 0.0313 -1.35 1.78e- 1
## 10 x3_piece -0.213 0.0716 -2.97 2.95e- 3
## 11 x4_piece -0.330 0.187 -1.76 7.77e- 2
Assessing Model Performance:
predict(hp_log_fit, new_data = hp2019_train, type = "prob")
## # A tibble: 3,551 × 2
## .pred_0 .pred_1
## <dbl> <dbl>
## 1 0.845 0.155
## 2 0.745 0.255
## 3 0.242 0.758
## 4 0.860 0.140
## 5 0.914 0.0865
## 6 0.342 0.658
## 7 0.816 0.184
## 8 0.668 0.332
## 9 0.707 0.293
## 10 0.638 0.362
## # … with 3,541 more rows
augment(hp_log_fit, new_data = hp2019_train) %>%
conf_mat(truth = win, estimate = .pred_class)
## Truth
## Prediction 0 1
## 0 1380 499
## 1 391 1281
augment(hp_log_fit, new_data = hp2019_train) %>%
conf_mat(truth = win, estimate = .pred_class) %>%
autoplot(type = "heatmap")
Checking accuracy:
hp_log_reg_acc <- augment(hp_log_fit, new_data = hp2019_train) %>%
accuracy(truth = win, estimate = .pred_class)
hp_log_reg_acc
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.749
hp_svm_rbf_spec <- svm_rbf() %>%
set_mode("classification") %>%
set_engine("kernlab")
hp_svm_rbf_fit <- hp_svm_rbf_spec %>%
fit(win ~ k_d + assists + damage_dealt + player_spm +
hill_time_s + hill_captures + hill_defends +
x2_piece + x3_piece + x4_piece,
data = hp2019_train)
augment(hp_svm_rbf_fit, new_data = hp2019_train) %>%
conf_mat(truth = win, estimate = .pred_class)
## Truth
## Prediction 0 1
## 0 1394 439
## 1 377 1341
hp_svmAccuracy <- augment(hp_svm_rbf_fit, new_data = hp2019_train) %>%
accuracy(truth = win, estimate = .pred_class)
hp_svmAccuracy
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.770
hp_accuracies <- c(hp_dt_accuracy$.estimate,
hp_rfAccuracy,
hp_log_reg_acc$.estimate,
hp_svmAccuracy$.estimate)
hp_accuracies
## [1] 0.7203605 0.7386652 0.7493664 0.7702056
As we can see from this, the SVM appears to have the highest accuracy.
Splitting Data:
set.seed(1)
snd2019_split <- snd2019 %>%
initial_split(prop = 0.8, strata = "win")
snd2019_train <- training(snd2019_split)
snd2019_test <- testing(snd2019_split)
head(snd2019_train)
## # A tibble: 6 × 25
## # Groups: player [1]
## match_id team player role win kills deaths k_d assists damage_dealt
## <chr> <chr> <chr> <fct> <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 737440468739… eUni… Abezy 1 0 11 7 1.57 0 1355
## 2 144010197940… eUni… Abezy 1 0 7 7 1 1 1938
## 3 259560665349… eUni… Abezy 1 0 2 7 0.29 4 796
## 4 146217688303… eUni… Abezy 1 0 7 7 1 0 1275
## 5 177954718572… eUni… Abezy 1 0 9 8 1.12 1 1561
## 6 167778343948… eUni… Abezy 1 0 10 9 1.11 3 1434
## # … with 15 more variables: player_spm <dbl>, bomb_sneak_defuses <dbl>,
## # bomb_plants <dbl>, bomb_defuses <dbl>, snd_rounds <dbl>,
## # snd_firstbloods <dbl>, snd_1_kill_round <dbl>, snd_2_kill_round <dbl>,
## # snd_3_kill_round <dbl>, snd_4_kill_round <dbl>, x2_piece <dbl>,
## # x3_piece <dbl>, x4_piece <dbl>, fb_round_ratio <dbl>, fb_avg <dbl>
dim(snd2019_train)
## [1] 2791 25
dim(snd2019_test)
## [1] 699 25
prop.table(table(snd2019_train$win))
##
## 0 1
## 0.498746 0.501254
Creating a general decision tree specification using rpart:
tree_spec <- decision_tree() %>%
set_engine("rpart")
class_tree_spec <- tree_spec %>%
set_mode("classification")
class_tree_fit <- class_tree_spec %>%
fit(win ~ k_d + role + assists + damage_dealt +
player_spm + bomb_sneak_defuses + bomb_plants +
bomb_defuses + fb_round_ratio + snd_firstbloods +
snd_1_kill_round + snd_2_kill_round + snd_3_kill_round +
snd_4_kill_round + x2_piece + x3_piece + x4_piece,
data = snd2019_train)
class_tree_fit %>%
extract_fit_engine() %>%
rpart.plot()
## Warning: Cannot retrieve the data used to build the model (so cannot determine roundint and is.binary for the variables).
## To silence this warning:
## Call rpart.plot with roundint=FALSE,
## or rebuild the rpart model with model=TRUE.
Checking confusion matrix and accuracy of the train data:
augment(class_tree_fit, new_data = snd2019_train) %>%
conf_mat(truth = win, estimate = .pred_class)
## Truth
## Prediction 0 1
## 0 1099 544
## 1 293 855
class_tree <- augment(class_tree_fit, new_data = snd2019_train) %>%
accuracy(truth = win, estimate = .pred_class)
class_tree
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.700
# library(randomForest)
# library(datasets)
# library(caret)
set.seed(306)
rf <- randomForest(win ~ k_d + role + assists + damage_dealt +
player_spm + bomb_sneak_defuses + bomb_plants +
bomb_defuses + fb_round_ratio + snd_firstbloods +
snd_1_kill_round + snd_2_kill_round + snd_3_kill_round +
snd_4_kill_round + x2_piece + x3_piece + x4_piece,
data = snd2019_train,
ntree = 500)
print(rf)
##
## Call:
## randomForest(formula = win ~ k_d + role + assists + damage_dealt + player_spm + bomb_sneak_defuses + bomb_plants + bomb_defuses + fb_round_ratio + snd_firstbloods + snd_1_kill_round + snd_2_kill_round + snd_3_kill_round + snd_4_kill_round + x2_piece + x3_piece + x4_piece, data = snd2019_train, ntree = 500)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 4
##
## OOB estimate of error rate: 29.7%
## Confusion matrix:
## 0 1 class.error
## 0 1032 360 0.2586207
## 1 469 930 0.3352395
rfAccuracy <- (1032 + 930) / (1032 + 930 + 360 + 469)
rfAccuracy
## [1] 0.7029738
importance(rf)
## MeanDecreaseGini
## k_d 310.6887002
## role 49.0524625
## assists 104.8659545
## damage_dealt 216.9071156
## player_spm 237.4322880
## bomb_sneak_defuses 5.8702582
## bomb_plants 52.2779181
## bomb_defuses 24.2341082
## fb_round_ratio 80.0289794
## snd_firstbloods 37.7644198
## snd_1_kill_round 82.4309082
## snd_2_kill_round 55.7139776
## snd_3_kill_round 26.4644834
## snd_4_kill_round 9.5438101
## x2_piece 41.6299549
## x3_piece 8.1996818
## x4_piece 0.8748542
varImpPlot(rf)
pred1=predict(rf,type = "prob")
library(ROCR)
perf = prediction(pred1[,2], snd2019_train$win)
# 1. Area under curve
auc = performance(perf, "auc")
auc
## A performance instance
## 'Area under the ROC curve'
# 2. True Positive and Negative Rate
pred3 = performance(perf, "tpr","fpr")
# 3. Plot the ROC curve
plot(pred3,main="ROC Curve for Random Forest",col=2,lwd=2)
abline(a=0,b=1,lwd=2,lty=2,col="gray")
Recipe, Engine, and Workflow:
snd_recipe <- recipe(win ~ k_d + role + assists + damage_dealt +
player_spm + bomb_sneak_defuses + bomb_plants +
bomb_defuses + fb_round_ratio + snd_firstbloods +
snd_1_kill_round + snd_2_kill_round + snd_3_kill_round +
snd_4_kill_round + x2_piece + x3_piece + x4_piece,
data = snd2019_train)
log_reg <- logistic_reg() %>%
set_engine("glm") %>%
set_mode("classification")
log_wkflow <- workflow() %>%
add_model(log_reg) %>%
add_recipe(snd_recipe)
log_fit <- fit(log_wkflow, snd2019_train)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
log_fit %>%
tidy()
## # A tibble: 19 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -1.49 0.181 -8.23 1.93e-16
## 2 k_d 5.22 0.292 17.8 2.93e-71
## 3 role2 -0.141 0.105 -1.34 1.81e- 1
## 4 role3 -0.119 0.132 -0.901 3.67e- 1
## 5 assists 0.375 0.0367 10.2 1.49e-24
## 6 damage_dealt -0.000364 0.000168 -2.17 3.03e- 2
## 7 player_spm -0.0168 0.00372 -4.50 6.78e- 6
## 8 bomb_sneak_defuses 0.0981 0.307 0.320 7.49e- 1
## 9 bomb_plants 0.229 0.0547 4.19 2.79e- 5
## 10 bomb_defuses 0.171 0.127 1.35 1.76e- 1
## 11 fb_round_ratio -11.9 2.80 -4.26 2.04e- 5
## 12 snd_firstbloods 1.44 0.304 4.73 2.30e- 6
## 13 snd_1_kill_round -0.422 0.0564 -7.48 7.46e-14
## 14 snd_2_kill_round -1.02 0.115 -8.88 6.69e-19
## 15 snd_3_kill_round -1.61 0.186 -8.66 4.78e-18
## 16 snd_4_kill_round -2.65 0.324 -8.18 2.74e-16
## 17 x2_piece 0.206 0.0808 2.54 1.09e- 2
## 18 x3_piece 0.213 0.257 0.829 4.07e- 1
## 19 x4_piece 0.111 0.876 0.126 8.99e- 1
Assessing Model Performance:
predict(log_fit, new_data = snd2019_train, type = "prob")
## # A tibble: 2,791 × 2
## .pred_0 .pred_1
## <dbl> <dbl>
## 1 0.556 0.444
## 2 0.716 0.284
## 3 0.696 0.304
## 4 0.635 0.365
## 5 0.758 0.242
## 6 0.816 0.184
## 7 0.461 0.539
## 8 0.732 0.268
## 9 0.409 0.591
## 10 0.438 0.562
## # … with 2,781 more rows
augment(log_fit, new_data = snd2019_train) %>%
conf_mat(truth = win, estimate = .pred_class)
## Truth
## Prediction 0 1
## 0 1138 477
## 1 254 922
augment(log_fit, new_data = snd2019_train) %>%
conf_mat(truth = win, estimate = .pred_class) %>%
autoplot(type = "heatmap")
Checking accuracy:
log_reg_acc <- augment(log_fit, new_data = snd2019_train) %>%
accuracy(truth = win, estimate = .pred_class)
log_reg_acc
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.738
svm_rbf_spec <- svm_rbf() %>%
set_mode("classification") %>%
set_engine("kernlab")
svm_rbf_fit <- svm_rbf_spec %>%
fit(win ~ k_d + role + assists + damage_dealt +
player_spm + bomb_sneak_defuses + bomb_plants +
bomb_defuses + fb_round_ratio + snd_firstbloods +
snd_1_kill_round + snd_2_kill_round + snd_3_kill_round +
snd_4_kill_round + x2_piece + x3_piece + x4_piece,
data = snd2019_train)
augment(svm_rbf_fit, new_data = snd2019_train) %>%
conf_mat(truth = win, estimate = .pred_class)
## Truth
## Prediction 0 1
## 0 1173 436
## 1 219 963
svmAccuracy <- augment(svm_rbf_fit, new_data = snd2019_train) %>%
accuracy(truth = win, estimate = .pred_class)
svmAccuracy
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.765
snd_accuracies <- c(class_tree$.estimate,
rfAccuracy,
log_reg_acc$.estimate,
svmAccuracy$.estimate)
snd_accuracies
## [1] 0.7001075 0.7029738 0.7380867 0.7653171
As we can see from this, the SVM model appears to have the highest accuracy.
Splitting Data:
set.seed(1)
control2019_split <- control2019 %>%
initial_split(prop = 0.8, strata = "win")
control2019_train <- training(control2019_split)
control2019_test <- testing(control2019_split)
head(control2019_train)
## player role win k_d assists damage_dealt player_spm x2_piece x3_piece
## 4 Abezy 1 0 1.30 17 6459 386.2 5 1
## 16 Abezy 1 0 0.88 4 5353 215.9 2 0
## 18 Abezy 1 0 1.00 3 2657 285.7 3 0
## 21 Abezy 1 0 0.70 4 3077 225.0 1 0
## 22 Abezy 1 0 0.75 5 3357 217.5 2 1
## 23 Abezy 1 0 0.78 5 4315 215.3 1 1
## x4_piece ctrl_firstbloods ctrl_firstdeaths ctrl_captures
## 4 0 1 0 4
## 16 0 1 0 2
## 18 0 0 0 1
## 21 0 0 0 1
## 22 0 1 1 1
## 23 0 1 0 2
dim(control2019_train)
## [1] 2122 13
dim(control2019_test)
## [1] 532 13
prop.table(table(control2019_train$win))
##
## 0 1
## 0.4995287 0.5004713
Creating a general decision tree specification using rpart:
control_tree_spec <- decision_tree() %>%
set_engine("rpart")
control_class_tree_spec <- control_tree_spec %>%
set_mode("classification")
control_class_tree_fit <- control_class_tree_spec %>%
fit(win ~ k_d + role + assists + damage_dealt +
player_spm + x2_piece + x3_piece + x4_piece +
ctrl_firstbloods + ctrl_firstdeaths + ctrl_captures,
data = control2019_train)
control_class_tree_fit %>%
extract_fit_engine() %>%
rpart.plot()
## Warning: Cannot retrieve the data used to build the model (so cannot determine roundint and is.binary for the variables).
## To silence this warning:
## Call rpart.plot with roundint=FALSE,
## or rebuild the rpart model with model=TRUE.
Checking confusion matrix and accuracy of the train data:
augment(control_class_tree_fit, new_data = control2019_train) %>%
conf_mat(truth = win, estimate = .pred_class)
## Truth
## Prediction 0 1
## 0 742 250
## 1 318 812
control_dt_accuracy <- augment(control_class_tree_fit, new_data = control2019_train) %>%
accuracy(truth = win, estimate = .pred_class)
control_dt_accuracy
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.732
# library(randomForest)
# library(datasets)
# library(caret)
control_rf <- randomForest(win ~ k_d + role + assists + damage_dealt +
player_spm + x2_piece + x3_piece + x4_piece +
ctrl_firstbloods + ctrl_firstdeaths + ctrl_captures,
data = control2019_train,
ntree = 500)
print(control_rf)
##
## Call:
## randomForest(formula = win ~ k_d + role + assists + damage_dealt + player_spm + x2_piece + x3_piece + x4_piece + ctrl_firstbloods + ctrl_firstdeaths + ctrl_captures, data = control2019_train, ntree = 500)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 24.65%
## Confusion matrix:
## 0 1 class.error
## 0 803 257 0.2424528
## 1 266 796 0.2504708
control_rfAccuracy <- (795 + 786) / (795 + 786 + 276 + 265)
control_rfAccuracy
## [1] 0.7450518
importance(control_rf)
## MeanDecreaseGini
## k_d 250.908459
## role 35.679172
## assists 102.439185
## damage_dealt 172.520556
## player_spm 222.893690
## x2_piece 62.155880
## x3_piece 22.170488
## x4_piece 6.380112
## ctrl_firstbloods 28.113423
## ctrl_firstdeaths 31.784533
## ctrl_captures 112.493422
varImpPlot(control_rf)
control_pred1 = predict(control_rf, type = "prob")
library(ROCR)
control_perf = prediction(control_pred1[,2], control2019_train$win)
# 1. Area under curve
control_auc = performance(control_perf, "auc")
control_auc
## A performance instance
## 'Area under the ROC curve'
# 2. True Positive and Negative Rate
control_pred3 = performance(control_perf, "tpr","fpr")
# 3. Plot the ROC curve
plot(control_pred3, main="ROC Curve for Random Forest", col=2, lwd=2)
abline(a=0, b=1, lwd=2, lty=2, col="gray")
Recipe, Engine, and Workflow:
control_recipe <- recipe(win ~ k_d + role + assists + damage_dealt +
player_spm + x2_piece + x3_piece + x4_piece +
ctrl_firstbloods + ctrl_firstdeaths + ctrl_captures,
data = control2019_train)
control_log_reg <- logistic_reg() %>%
set_engine("glm") %>%
set_mode("classification")
control_log_wkflow <- workflow() %>%
add_model(control_log_reg) %>%
add_recipe(control_recipe)
control_log_fit <- fit(control_log_wkflow, control2019_train)
control_log_fit %>%
tidy()
## # A tibble: 13 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -4.67 0.327 -14.3 3.12e-46
## 2 k_d 4.14 0.316 13.1 2.74e-39
## 3 role2 -0.0116 0.128 -0.0912 9.27e- 1
## 4 role3 -0.0935 0.154 -0.606 5.44e- 1
## 5 assists 0.192 0.0222 8.69 3.75e-18
## 6 damage_dealt -0.000466 0.0000680 -6.86 7.10e-12
## 7 player_spm 0.000260 0.00176 0.148 8.83e- 1
## 8 x2_piece -0.124 0.0454 -2.74 6.14e- 3
## 9 x3_piece -0.0678 0.117 -0.579 5.62e- 1
## 10 x4_piece -0.142 0.337 -0.422 6.73e- 1
## 11 ctrl_firstbloods 0.0426 0.0906 0.471 6.38e- 1
## 12 ctrl_firstdeaths 0.231 0.0878 2.63 8.58e- 3
## 13 ctrl_captures 0.583 0.0546 10.7 1.42e-26
Assessing Model Performance:
predict(control_log_fit, new_data = control2019_train, type = "prob")
## # A tibble: 2,122 × 2
## .pred_0 .pred_1
## <dbl> <dbl>
## 1 0.0598 0.940
## 2 0.850 0.150
## 3 0.713 0.287
## 4 0.872 0.128
## 5 0.828 0.172
## 6 0.805 0.195
## 7 0.615 0.385
## 8 0.510 0.490
## 9 0.528 0.472
## 10 0.333 0.667
## # … with 2,112 more rows
augment(control_log_fit, new_data = control2019_train) %>%
conf_mat(truth = win, estimate = .pred_class)
## Truth
## Prediction 0 1
## 0 842 272
## 1 218 790
augment(control_log_fit, new_data = control2019_train) %>%
conf_mat(truth = win, estimate = .pred_class) %>%
autoplot(type = "heatmap")
Checking accuracy:
control_log_reg_acc <- augment(control_log_fit, new_data = control2019_train) %>%
accuracy(truth = win, estimate = .pred_class)
control_log_reg_acc
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.769
control_svm_rbf_spec <- svm_rbf() %>%
set_mode("classification") %>%
set_engine("kernlab")
control_svm_rbf_fit <- control_svm_rbf_spec %>%
fit(win ~ k_d + role + assists + damage_dealt +
player_spm + x2_piece + x3_piece + x4_piece +
ctrl_firstbloods + ctrl_firstdeaths + ctrl_captures,
data = control2019_train)
augment(control_svm_rbf_fit, new_data = control2019_train) %>%
conf_mat(truth = win, estimate = .pred_class)
## Truth
## Prediction 0 1
## 0 883 227
## 1 177 835
control_svmAccuracy <- augment(control_svm_rbf_fit, new_data = control2019_train) %>%
accuracy(truth = win, estimate = .pred_class)
control_svmAccuracy
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.810
control_accuracies <- c(control_dt_accuracy$.estimate,
control_rfAccuracy,
control_log_reg_acc$.estimate,
control_svmAccuracy$.estimate)
control_accuracies
## [1] 0.7323280 0.7450518 0.7690858 0.8096136
Once again, we can see that the SVM model had the highest accuracy with 0.8096136.